XGBoost | GridSearchCV
This page documents the README for using the machine learning model I deployed for predicting student scores. Unfortunately, I do not have permission to disclose the dataset. The instructions on how to use the functions (data engineering, encoding and imputation based on YAML file configurations, hyperparameter tuning, model validation, model prediction) built around the machine learning model developed are detailed here, along with set-up and overall architecture of the software. The code can be found on my GitHub.
đŚAIAP_ASSESSMENT
⣠đdata
â ⣠đscore.csv
â ⣠đscore.db
â â đscore_processed.csv
⣠đlogs
â ⣠đgscv_test.yml
â â đgscv_test_log.yml
⣠đsrc
â ⣠đconfig.yml
â ⣠đgrid_search.py
â ⣠đmain.py
â ⣠đmake_prediction.py
â â đmake_validation.py
⣠đeda.ipynb
⣠đREADME.md.txt
⣠đrequirements.txt
â đrun.sh
Before attempting to execute the pipeline, complete the following relevant set-up procedures for terminal of choice:
â Set-up for Git Bash â
Note: If Anaconda is being used to run python and âpythonâ command cannot be found by Git Bash, run this command to set the PATH extension to Anacondaâs python.exe: âPATH=$PATH:/D/ANACONDA/â
â Set-up for Linux â
Once the set-up is complete, open the âconfig.ymlâ file found in the âsrcâ folder. This configuration file contains some of the parameters that will be used by the models and data processing.
âconfig.ymlâ contains all configurations for the machine learning pipeline and its functions.The full breakdown of the file is as seen below:
xgbr is an abbreviation for XGBRegressor() and âxgbr_parametersâ is a dictionary which contains the parameters for configuring an instance of the model. The parameters are accessed when you set âuse_configuredâ to âyâ when running âpython make_validation.pyâ.
xgbr_parameters:
objective: "reg:squarederror"
learning_rate: 0.2
min_split_loss: 0.01
max_depth: 7
min_child_weight: 1.0
subsample: 1
reg_lambda: 1.0
reg_alpha: 0
scale_pos_weight: 1.0
max_bin: 256
booster: "gblinear"
max_delta_step: 0
verbosity: 2
xgbc is an abbreviation for XGBClassifier() and âxgbc_parametersâ is a dictionary which contains the parameters for configuring an instance of the model. The parameters are accessed when you set âuse_configuredâ to âyâ when running âpython make_validation.pyâ.
xgbc_parameters:
objective: "multi:softprob"
learning_rate: 0.2
min_split_loss: 0.01
max_depth: 7
min_child_weight: 1.0
subsample: 1
reg_lambda: 1.0
reg_alpha: 0
scale_pos_weight: 1.0
max_bin: 256
booster: "gbtree"
max_delta_step: 0
verbosity: 2
feature_parameters is a setting for the data processing in the pipeline, set to True (boolean) to include engineered features, set to False (boolean) to exclude engineered features.
feature_parameters:
include: True
impute_parameters is a setting for the data processing in the pipeline, set to âiterâ to use iterative imputation for NaN data, set to âsimpâ to use simple (mean) imputation for NaN data.
impute_parameters:
impute_type: "simp"
grid_search_parameters is a dictionary containing grid search CV (cross validation) parameters for different models. Currently, the included models are XGBRegressor and XGBClassifier. Each model has a dictionary within the grid_search_parameters dictionary which will be accessed when you select a specific model to run grid search CV on. Specify the different values in the lists below for grid search to form a cross multiplication of all parameter values before testing out each combination. The parameters are accessed when you run âpython grid_search.pyâ.
Note: Grid search cross validation can take a long time, set verbose (optional argument) to 3 to receive updates on its progress*
grid_search_parameters:
xgbr:
learning_rate: [0.05]
min_split_loss: [0]
max_depth: [20]
min_child_weight: [0]
subsample: [1.0]
reg_lambda: [1.0]
reg_alpha: [0]
scale_pos_weight: [0.05, 0.1]
max_bin: [256]
booster: ["gbtree"]
max_delta_step: [0]
n_estimators: [200]
xgbc:
learning_rate: [0.1, 0.2]
min_split_loss: [0.01]
max_depth: [7]
min_child_weight: [1.0]
subsample: [0.9]
reg_lambda: [1.0]
reg_alpha: [0]
scale_pos_weight: [1.0]
max_bin: [256]
booster: ["gbtree"]
max_delta_step: [0]
n_estimators: [200]
validation_parameters is a dictionary containing parameters which will be used for repeated K-Fold validation for model validation when you run âpython make_validation.pyâ
validation_parameters:
n_splits: 5
n_repeats: 2
prediction_parameters is a dictionary containing parameters which will be used for prediction when you run âpython make_validation.pyâ
Note If there are unknown parameters, input a best estimate of population average
All inputs are to be type float or str (add .0 if it is meant to be an integer)
prediction_parameters:
age: 16.0
student_id: C130
number_of_siblings: 2.0
n_male: 0.0
n_female: 3.0
hours_per_week: 10.0
attendance_rate: 10.0
sleep_time: '22:00'
wake_time: '6:00'
direct_admission: 'No'
CCA: Arts
learning_style: Visual
gender: Female
tuition: 'Yes'
mode_of_transport: "walk"
bag_color: rainbow
final_test: unknown
Details on the XGBoost parameters available here: XGBoost Documentation
Navigate to the base directory containing ârun.shâ on a terminal and key in 'bash run.sh'. The database will be downloaded, packages required for running the scripts will be installed, and additional .csv files containing processed data will be generated to support scripts. A series of short tests will also be run on the different python functions which can be called through the terminal, introducing some of the utility of the pipeline while probing for errors.
Note: All python scripts were written tested using Python ver==3.8 in an Anaconda environment.
# Running the run.sh on terminal
# In the Terminal, navigate to the directory containing 'run.sh'
# 'chmod u+x run.sh ' marks the shell script as executable and grants permissions to the file 'run.sh' to install packages.
chmod u+x run.sh
# 'bash run.sh' executes the shell file. The bash file will begin importing libraries and some tests immediately.
bash run.sh
# If you encounter any ImportErrors (e.g. 'ImportError: DLL load failed while importing qhull: The specified module could not be found.'). Uninstall the package from your environment and change the version of the package in 'requirements.txt' to the latest version available online before running 'bash run.sh' again.
Enter the âsrcâ directory through Git Bash to gain
access to the following scripts (type -h for more info on the arguments).
The random state of all functions have been kept constant, hence for a specific train-test split, simply keep the
ratio the same to keep datasets the same. Each function will be explained in detail in the following subsections.
Specify the parameters for grid search cross validation to run a search on. All parameters can be found in the âconfig.ymlâ under âgrid_search_parametersâ.
positional arguments | details/explanation |
---|---|
estimator | input the specific model for parameter optimization (e.g. xgbr,xgbc) |
scoring | input the specific scoring metric for parameter optimization, full list available at: sklearn parameters |
model_type | input the model type (regression/classification) |
ratio | input the ratio for train-validation split (float will indicate proportion allocated to training set) |
optional arguments | details/explanation |
---|---|
-h, --help | show this help message and exit |
-filename | input the log filename e.g. âxgbr1â, logged files contain the best parameters and corresponding score for each run and are stored in the âlogsâ folder as â.ymlâ files |
-cv | input the number of folds of cross validation to use during optimization, default=5 |
-n_jobs | input number of processors to use for search (-1 for max, else specify int) |
-verbose | input verbosity level (ascending verbosity from 1-3), default=0 |
# Sample command to run on terminal
python grid_search.py xgbr neg_mean_squared_error regression 0.8 -filename cheese3
Specify the parameters to run a K-Fold validation on a specified model. The configured model parameters can be found in âconfig.ymlâ under âmodelname_parametersâ.
positional arguments | details/explanation |
---|---|
estimator | input the specific model for parameter optimization (e.g. xgbr,xgbc) |
scoring | input the specific scoring metric for parameter optimization, full list available at: sklearn parameters |
model_type | input the model type (regression/classification) |
ratio | input the ratio for train-validation split (float will indicate proportion allocated to training set) |
optional arguments | details/explanation |
---|---|
-h, --help | show this help message and exit |
-filename | input the log filename e.g. âxgbr1â, logged files contain the best parameters and corresponding score for each run and are stored in the âlogsâ folder as â.ymlâ files |
-cv | input the number of folds of cross validation to use during optimization, default=5 |
-n_jobs | input number of processors to use for search (-1 for max, else specify int) |
-use_configured | use configured model parameters (âyâ) or use default parameters (ânâ), default=âyâ |
-verbose | input verbosity level (ascending verbosity from 1-3), default=0 |
# Sample command to run on terminal
python make_validation.py xgbr neg_mean_squared_error regression 0.8 -use_configured n
Specify the parameters to run a prediction or test using a specified model. The input parameters for prediction can be found in âconfig.ymlâ under âprediction_parametersâ.
positional arguments | details/explanation |
---|---|
estimator | input the specific model for parameter optimization (e.g. xgbr,xgbc) |
scoring | input the specific scoring metric for parameter optimization, full list available at: sklearn parameters |
model_type | input the model type (regression/classification) |
ratio | input the ratio for train-validation split (should match all previous ratios used) |
optional arguments | details/explanation |
---|---|
-h, --help | show this help message and exit |
-testing | input your choice, testing or predicting (t/p), default=âtâ |
-n_jobs | input number of processors to use for search (-1 for max, else specify int) |
-use_configured | use configured model parameters (âyâ) or use default parameters (ânâ), default=âyâ |
-verbose | input verbosity level (ascending verbosity from 1-3), default=0 |
# Sample command to run on terminal
python make_prediction.py xgbr neg_mean_squared_error regression 0.9 -use_configured n -testing p
.ipynb
,
this section should be a quick summary.Engineered Feature | Logic |
---|---|
sleep_hours based on sleep_time and wake_time | sleep = health = memory + attendance = good score |
privilege based on number of siblings and tuition | ((number_of_siblings = large) + (tuition = none)) = underprivileged |
female_class and male_class to identify single-sex classes | (good single-sex schools in Singapore = potential indicator of test scores)/(single-sex schools = more focussed students) |
class_size based on n_male and n_female | larger classes = less attention from teachers and less resources from school = less support from school |
Note that all decisions were supported by visualizations and statistics (in EDA) on top of the logic.
Engineered Feature | Logic |
---|---|
gender_ratio in class based on n_male and n_female | (unbalanced class gender might affect attention)/(alternative logic: unbalanced class gender might affect resource distribution |
n_male_cat and n_female_cat categories based on number of students of specific gender in class | capture class sizes categorically, but too closely related to n_male and n_female |
This dataset has moderate dimensionality, small
data size, data on different scales (which can be scaled if desired), many numpy.zeros, around 5% of missing values
(non-label features) which may or may not benefit from imputation and for the context of this problem. Additionally,
there is the requirement for a regression model and a classification model.
XGBoost is a good candidate for dealing with the above conditions given that it is not sensitive to scale, efficient
at storing numpy.zeros, able to ignore NaN data while still making use of other data within the row and has both a
regression model and classification model ready for parameter tuning.
First off, this is a supervised learning problem, which means the goal is to use the data given and the corresponding labels to build a model which is capable of predicting a outputs based on sets of data given, this means all unsupervised learning models do not need to be considered. Secondly, the moderate dimensionality and many non-linear relationship between/within features are signs that all linear models should not be considered. Decision tree based models are good for modelling low-dimensionality, simple problems with a limited space, but quickly become ineffective once the problem has a moderate number of parameters. Lastly, predicting exam scores can be extremely complex as exams are not environments of consistency. Even if a model has successfully identified that a student was supposed to perform well based on the features, it is possible that in the final exam the student fumbles due to stress, carelessness or inability to focus. Hence a decision tree ensemble or forest that is capable of modelling the complexity of the problem while maintaining a high degree of flexibility for regularization so as not to overfit on features that have an inherently unpredictable output is required. This is where XGBoost comes in!
XGBoost (Extreme Gradient Boosting) uses an ensemble of gradient boosted trees to model the problem. There is a good number of parameters to tune the XGBoost. The number of trees, depth of the trees and the number of leaves the tree can have can be adjusted to suit the complexity of the model (more complex models require more trees and a larger depth to model appropriately simply because there are more parameters to factor in each tree). But more importantly for the context of this problem, the ways which the model can be regularized are abundant. Starting from the selection of random subsamples (rows) for the training of each tree, selection of a fraction of columns for the training of each level of the tree, the minimum amount of gain (also known as gamma or loss reduction) required for a tree to create an additional branch, the minimum total number of data instances in each leaf to be considered for branching, down to the lamda and alpha regularization terms which affect the gain calculation formula for each step of tree branching (thereby tuning how âfinelyâ each tree branches out) - the model has the right amount of regularization capabilities for this problem.
For the classification model, a label needed to be generated based on the data. Since education is about equalizing, but resources are limited and resource allocation is about optimization, the students were labelled based on scoring percentiles. Those scoring below a certain percentile will be considered as ârequiring support/attentionâ. This makes more sense than setting a raw score because the resources should be allocated in proportion to neediness but the number of needy students a school can support is in reality limited. Hence the schools should focus more on those who are performing worst in their school as compared to those who are performing below some specific score (as this might diffuse the attention and resources from those who need it most).
For this project, the percentiles used to label the students (the label was named final_grade) are as follows:
final_grade | Percentile | Score Band |
---|---|---|
1 | 0.1 | [0 - 48] |
2 | 0.3 | (48 - 58) |
3 | 0.6 | [58 - 71) |
4 | 0.6 | [71 - 100] |
There was insufficient time to run a thorough grid search cross validation for the models to tune the parameters. However, without tuning, the vanilla XGBRegressor() and XGBClassifier() models produced reasonable results during K-Fold Cross Validation and Testing.
Metric | XGBRegressor() | XGBClassifier() |
---|---|---|
Mean Squared Error (squared error between prediction and actual value) | 28 - 32 | â |
Accuracy (number of correcly predicted labels/number of tested labels) | â | 78 - 80% |
Based on experiments in the EDA, the sensitivity of
classifying the final_grade=1 students was 70.7%, which is not as good enough for deployment. The
full confusion matrix is as seen below:
From the confusion matrix above, it is apparent that the model is showing a bias towards classifying students in higher into grades.
We can see that the model has difficulty classifying students in their exact final_grade category (e.g for final_grade=1 students only 104/147 (70.7%) of the students were correctly categorized - it is not very sensitive to the performance of final_grade=1 students). Given that we know that the number of the students that belong in final_grade=1 based on the percentiles we set for the grade thresholds, there is sufficient data in terms of volume (relative to the data set given) to characterize a final_grade=1 student. The poor sensitivity could simply mean it is harder to predict students that are going to perform poorly as opposed to those who will do well (final_grade=4 student prediction has a sensitivity of 538/612 (87.9%). Or it could also mean that the quality of data collected for students belonging to final_grade=1 is poorer (e.g. false data being fed on the number of hours studied per week etc.). Among all the parameters to analyze the the confusion matrix with, sensitivity is the most relevant as the schoolâs top priority is to prevent students from falling through the cracks (in this case, being falsely classified as negative. To improve the sensitivity towards the characteristics of final_grade=1 students, it would be good to either increase the quantity of data from Grade 1 students in the dataset or to improve the quality of data collected from these students. It would also be good to collect data specific to the identification of final_grade=1 students - perhaps such as âdetentions_receivedâ or something similar. From a model-side perspective, identifying the features that help differentiate a Grade 1 student from other students (if such a feature exists) and assigning them a larger weight would address this issue.
If more time was available, experimenting with the exact percentiles to best split the threshold, and the number of thresholds to create would be useful as well - although this is likely to differ from dataset to dataset as it will form school to school. For example, if the final_grade=1 percentile threshold is too high, the characteristics of students who truly need help will be mixed with those who are on the borderline, or perhaps even just average. For this dataset, the threshold corresponds to those who score 48 marks and below for the final exam which is reasonable.
However, knowing that the model has a bias towards giving students a higher grade allows the user of the model to do one simple thing to take advantage of this fact: Take both final_grade=1 and final_grade=2 students as those who should be focused on. Based on the split above, doing so will capture 93.2% (137/147) of the students who require assistance (based on current percentile assumptions), which shows good potential for a model which has not been tuned.
For the context of this problem the XGBRegressor() is slightly easier to deploy in terms of resource allocation. Given a set of data from a population of students, the school need only run a regression on all the studentâs data and this will allow them to rank the students individually across the entire dataset. The school can then select students individually for focussed attention. This model is much more useful for detecting outliers than banding students for different treatment. For example, using the regressor, the school can pick the âworst 3 scorersâ in advance and focus attention on them e.g. 1-1 consultations. On the reverse side, the school can also use this model to proxy the âtop 3 performersâ in advance, perhaps to select them for some competition or program.
The deployment of the regressor model can enable schools to make targetted efforts involving a few students at the score extremes after regression. Alternatively, if the same banding principle used for generating the classification model labels is used on the scores predicted by regression, the school can also convert the individual scores into different bands and provide attention and resources to the students accordingly.
Predicting scores can be extremely difficult as exams are not the best environment for consistency. Even if a model has successfully identified that a student was supposed to perform well, it is possible that in the final exam the student fumbles due to stress, carelessness or inability to focus. If the model is expected to capture this information as well, it will be good to take multiple test scores and consolidate their average and variance as a proxy for performance consistency which can then be a feature for training an improved model which factors consistency.
Thank you for taking your time to read this document.
â Pre-requisites for running EDA Jupyter Notebook â